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Microarray Method 



Field of Invention 

This invention relates to a technique for analysing cross-species microarray data. 
Background of the Invention 

High-density oligonucleotide microarrays (see Patent numbers WO9710365, 
EP0853679 US6040138, DE69625920T) are commonly used to study global gene 
expression in several organisms for which complete or extensive genome sequence 
information is available. In agriculture such high throughput approaches could be 
used to select more environmentally friendly chemicals for plant protection and to 
develop plants with increased yield, better nutritional value and increased resistance 
to diseases The technology could be used in any analysis where cross-species 
hybridisation is possible including but not limited to prokaryotic, mammalian, plant 
nucleic acid sequences. However, lack of genomic sequence information for most 
crop species is a major limitation to gene expression profiling with high-density 
oligonucleotide arrays. 

A tailored approach to cross-species hybridisation of crop species nucleic acid high- 
density oligonucleotide arrays designed for model or extensively sequences orgamsms 
such as Arabidopsis is a novel solution. For the purposes of this application cross- 
species hybridisation is taken to mean the hybridisation of nucleic acid fragments of 
one species with nucleic acid fragments of a related species. 

The high-density oligonucleotide microarray platform uses between 1 1 and 20 pairs 
of oligonucleotide probes, called probe sets, to provide the complete sequence of the 
gene (the 'target'). The array platform comprises perfect match probes and mismatch 
probes The perfect match probes represent identical copies of the target sequence and 
the mismatch probe sequences differ from the perfect match sequences at the central 
nucleotide position. The mismatch probes measure non-specific binding. These 
oligonucleotide DNA probes are tethered by covalent attachment to a solid support to 
form an array platform (for example, the Affymetrix GeneChip®). 

Typically, during gene expression profiling, the mRNA target, which has been 
hybridized to probes on the array, is fluorescently labelled. The hybridisation signal, 
computed as the difference between the perfect match and the mismatch probes across 
a probe set represents a measure (also referred to as the expression estimate) of gene 
expression level. The hybridisation of target mRNA transcripts from a species other 
than from the species (here referred to as the extensively sequenced or 'reference 
species') for which the oligonucleotide probes tethered on the array were selected, is 
termed cross-species hybridisation. 



Prior to this invention several cross-species techniques (Chismar et al. Biotechniques 
33 (3); 516-524 (2002): Caceres et al. Proc Natl Acad Sci USA, 100 (22); 13030- 
13035, (2003): Higgins et al. Toxicological Sciences, 74 (2); 470-484 (2003): Becher 
et al. The Plant J, 37; 251-268 (2004): Weber et al. Hie Plant J, 37; 269-281 (2004)) 
have been proposed, however, none of these techniques are in routine use. Most of 
these techniques were designed specific for the species under investigation and 
therefore inaccessible to most scientists. Most importantly, this prior art fails to 
adequately deal with the integrity of expression estimates. 

Since high-density oligonucleotide arrays use 1 1-20 probes to interrogate each 
transcript it is conceivable that some of the probe sequences will differ from the cross- 
species sequence (i.e. the mRNA target). This poor sequence identity leads to 
inefficient hybridisation between array probes and the cross-species target These 
probes, with poor sequence identity, generate weak hybridisation intensities which 
lead to an attenuation of hybridisation signals of probe sets. Computation of 
expression estimates for the cross-species target using all probes (1 1-20) of a probe 
set, will therefore generate inaccurate expression estimates. 

Prior to the present invention, it has been proposed by others (Zhu et al. J Assoc. Lab. 
Automat, 6: 95-98 (2001)) that for cross-species microarray analysis, it is essential to 
pre-select usable probes by initially hybridising genomic DNA to the high-density 
oligonucleotide array. However, the prior art did not provide a rapid and effective 
method for identifying usable probes for cross-species data analysis. Probe selection 
by genomic DNA hybridisation prior to and independent of target mRNA 
hybridisation to the array, provides the most effective means of identifying usable 
probes as explained below. 

Description of the Invention 

This invention provides for a method of selecting a subset of probes, represented on a 
high-density oligonucleotide array, for use in the analysis of cross-species data, 
generated by hybridising the cross-species mRNA transcripts to the 'reference 
species' array. 

In a preferred embodiment, oligonucleotide probes can be selected by hybridising 
genomic DNA or RNA to the array. Methods of isolating genomic DNA are well 
known to those skilled in the art. The DNA can be labelled for hybridisation by using 
the Bioprime ® DNA labelling System (Invitrogen). Instructions for use are provided 
by the manufacturer. Procedure for genomic DNA hybridisation follows that for target 
mRNA (i.e. antisense cRNA) hybridisation as fully described in the Patent 
Application DE69625920T but without the use of the cRNA hybridisation controls 
(Bio B, Bio C, Bio D and Cre). After staining and scanning the array as described in 
the Patent Application DE69625920T, the software which is described in detail for 
example U.S. Pat. Nos. 5,547,839, 5,578,832, 5,631,734, generates a hybridisation 
intensity file (CEL) containing statistics of each probe on the array, e.g., the 75 



percentile of intensities, standard deviation of pixe l intensities and probe co-ordinates 
which represent the physical location of the probes on the array. 

A significant part of this invention is the construction of a series of programs using 
the scripting computer language PERL (see e.g., Wall, Christiansen, and Orwant 
Programming Perl, 3 rd Ed, O'Reilly and Associates (2000)) although equivalent 
scripts and programs can readily be developed in other computing languages by a 
person skilled in the art (including but not limited to C++, Java, Visual Basic). First, a 
perl script is constructed to extract probe co-ordinates with hybridisation intensity 
above background. The term background refers to non-specific hybridisation or other 
interactions between the hybridising target and components of the array. Fluorescence 
of the array components may also contribute to background. In one aspect of this 
preferred embodiment, background is calculated as the mean intensity of negative 
control probes. These negative control probes (Sio 13, JSio C, Bio 3D, Cre etc) are 
oligonucleotide probes selected from species other than the 'reference species' 
organism or the cross-species organism. 

The oligonucleotide probe co-ordinates selected by the perl script include both perfect 
match and mismatch probes. In a cross-species genomic DNA hybridisation it is 
conceivable that some of the mismatch probes will hybridise more efficiently to the 
cross-species target DNA sequence. 

In a further aspect of this embodiment of the invention, a second perl script is 
developed to eliminate mismatch probes with hybridisation intensity above 
background and with higher hybridisation intensity than perfect match probes. To , 
achieve this, the perl script uses as input, a file consisting of only perfect match co- 
ordinates and the output file of the first perl script. The output file generated in this 
embodiment consists of only perfect match probe co-ordinates with hybridisation 
intensities greater than the estimated background for the genomic DNA hybridisation. 
These perfect match oligonucleotide probes, which share high sequence similarity 
with the cross-species target, represent probes selected for the analysis of cross- 
species transcripts hybridised to the 'reference species' array. 

The selection of the corresponding mismatch oligonucleotide probes is carried out as 
described in the third aspect of the invention. 

In a third aspect of the preferred embodiment of the invention, a third perl script is 
constructed to complete the process of generating a chip description file (CDF) for the 
cross-species organism. This perl script uses as input, the output (selected perfect 
match probes) of the previous script and the chip description file of the 'reference 
species' organism. The X co-ordinate for both perfect match and mismatch probes for 
each gene sequence on the array is identical This information is coded into the perl 
script to enable the selection of both perfect match and mismatch probes from the 
'reference species' organism's CDF to construct a new chip description file for the 
cross-species. In this same aspect, the probes excluded from CDF construction can be 
used to construct a probe sensitivity index (PSI) file. The PSI file can be used to train 
a large data set in order to ensure consistency of data stored in a database. 

All software used in the computation of gene expression levels require a hybridisation 
intensity file (CEL) of the type described above and a chip description file (CDF) for 



\ 

the array type carrying the hybridised target transcripts. The chip description file is a 
library file consisting of gene (probe set) IDs, the corresponding co-ordinates of their 
probe sequences on the array and other software parameters. 

In a second embodiment of the invention, a BLAST (Altschul et al. 3 J. Mol. Biol. 
215; 403-410 (1990)) output file is generated in silico by comparing cross-species 
nucleic acid sequences in a database to nucleic acid sequences represented on the 
oligonucleotide array. This output file is parsed with another perl script to identify 
oligonucleotide probes with 100% sequence identity to the cross-species sequence. 
The probes selected can then be used to construct a chip description file for the cross- 
species organism as described above. 

The novelty of the invention in units of two embodiments lies in the informed 
selection of probe pairs (putative or actual match and mis-match) to allow high 
throughput genome-wide screening of the expression patterns of genes from genomes 
of species related to but not identical to the genome of at least one reference 
extensively-sequenced species such as but not limited to Arabidopsis, tomato, rice, 
mouse, Human, C. elegans, Bacillus sp., Drosophilia, Chimpanzee, Chicken, SARS 
virus etc. 

This invention is a major advancement in microarray analysis of cross-species data. In 
particular for species where sequence information is not yet available. The reduction 
to practice of this invention means it is readily accessible to scientists investigating 
global gene expression in for example Brassica crops (oilseed rape, broccoli, cabbage, 
cauliflower, Brussels sprouts, radish, horseradish etc), but the invention is applicable 
to any species with significant synteny to an extensively sequenced species. 



The invention is demonstrated in the following non-limiting examples. 
Examples 

Figure 1 shows gene expression levels computed with GeneChip® software (available 
from Affymetric Inc., Santa Clara, CA, USA). 

There are three main columns, the first column (without title) consists of gene (probe 
set) IDs. The second and the third columns, which are subdivided into two columns 
denoting number of probe pairs (Stat Pairs) and expression levels (Signal), represent a 
sample (Bo+P) analysed with the model organism CDF (ATH1 -Arabidopsis 
thaliand) and the cross-species CDF (Brassica). 

Notice the high level of expression values generated with the cross-species CDF. 
Since the mean of perfect match and mismatch probe pairs in a probe set is output by - 
the software as the expression estimate of transcripts, probes in the ATH1 CDF which 
are not responsive to Brassica transcripts will lead to an attenuation of signal for the 
probe set, generating an inaccurate expression estimate of the transcript, hi a typical 
experiment the background signal is usually less than 100. Therefore transcripts with 
expression levels below background are usually identified as undetectable in the 
sample being interrogated. 



Figure 2 shows dChip (Li and Wong, Proc. Natl. Acad. Sci. USA, 98; 31-36 (2001)) 
K dXof probe response patterns for the probe set, 246745 at. The black 
Sapl (PM arrows) represents perfect match probe intensities and the grey graph (MM 
Lows) represents mismatch probe intensities. There are two grids, A aod B ^ one 
depicting probe response pattern for 246745_at on an array (ATH1-12501) hybridized 
(see Patent No DE69625920T for all hybridization conditions) with an Arabidopsis 
.B1798 repl) cRNA target (grid A) and a probe response pattern for 246745_at on 
£ same a" ay type hybridized with Brassica (Bo + P_repl) cRNA target There are 1 1 
probe pairs in each grid, for grid A all eleven probe pairs respond actively to the 
Tabidopsis target However, for the Brassica (cross-species) target only a subset (see 
gSd B) of the if probes respond to the target. This is evident by the fact that for probe 
nairs 1-7 (grid B) the PM and MM curves are indistinguishable. Ymax for the graphs 
^presents the highestprobe signal intensity. These unresponsive probes were 
eliminated from the Arabidopsis (ATH1) chip description file to consul uie cxo.s- 
species (Brassica) chip description file. 
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ABSTRACT 



Microarray Data Analysis Technique 



This invention provides a novel method for the analysis of cross-species microarray 
data generated with for example GeneChip® technology. Essentially, oligonucleotide 
4>robes are pre-selected from for example the GeneChip® array by way of cross- 
species genomic DNA hybridisation to the GeneChip® array. Probe selection through 
database searches with sequences represented on for example GeneChip® arrays is an 
alternative. The selected probes are used to construct a library file for gene expression 
software designed for the computation of gene expression levels. 
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